1 Task 1: Where Do People Drink The Most Beer, Wine And Spirits?

Back in 2014, fivethiryeight.com published an article on alcohol consumption in different countries. We want to use the data in this article to find out in which countries people drink the most by looking at the consumption of beer, wine, and spirits.

library(fivethirtyeight)
data(drinks)

# or download directly
alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")

Let’s take a look at our data and find whether there are any missing values and what variables we have.

skim(alcohol_direct)
Data summary
Name alcohol_direct
Number of rows 193
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 28 0 193 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
beer_servings 0 1 106.16 101.14 0 20.0 76.0 188.0 376.0 ▇▃▂▂▁
spirit_servings 0 1 80.99 88.28 0 4.0 56.0 128.0 438.0 ▇▃▂▁▁
wine_servings 0 1 49.45 79.70 0 1.0 8.0 59.0 370.0 ▇▁▁▁▁
total_litres_of_pure_alcohol 0 1 4.72 3.77 0 1.3 4.2 7.2 14.4 ▇▃▅▃▁

The dataset contains one character (qualitative) variable, namely country, and four numeric variables (beer_servings, spirits_servints, wine_servings and total_litres_of_pure_alcohol) for which we can also see some summary statistics showing that the data is right-skewed. There are no missing values as can be seen from the n_missing variable that is 0 for all five variables contained in the dataset.

We use ggplot to graph the top 25 consuming countries for beer, wine and spirits respectively.

alcohol_direct %>% 
  slice_max(order_by = beer_servings,
            n = 25) %>%    #pick only the top 25 countries with the most beer_servings
  ggplot(aes(x = beer_servings,
             y = fct_reorder(country, beer_servings))) + #order the country with beer_servings
  geom_col(fill = "gold",
           color = "grey") +
  labs(title="Top 25 beer consuming countries",
       x = "Number of Beer Servings",
       y = "Country") +
  coord_cartesian(xlim = c(100,NA)) +
  NULL

alcohol_direct %>% 
  slice_max(order_by = wine_servings,
            n = 25) %>%
  ggplot(aes(x = wine_servings,
             y = fct_reorder(country, wine_servings))) +
  geom_col(fill = "deeppink4",    #choose the most wine-like color in the list of color table
           color = "grey") +
  labs(title = "Top 25 wine consuming countries",
       x = "Number of Wine Servings",
       y = "Country") +
  coord_cartesian(xlim = c(100,NA)) +
  NULL

alcohol_direct %>% 
  slice_max(order_by = spirit_servings,
            n = 25) %>%
  ggplot(aes(x = spirit_servings,
             y = fct_reorder(country,
                           spirit_servings))) +
  geom_col(fill = "cornsilk3", color = "grey") +
  labs(title = "Top 25 spirits consuming countries",
       x = "Number of Spirit Servings",
       y = "Country") +
  coord_cartesian(xlim = c(100,NA)) +
  NULL

We can infer from the above graphs…

On a high-level overview we observe different countries being in the top spots for each type of alcohol.

  • Beginning with beer consumption, the top 5 spots are dominated by central and eastern European countries with a few “surprises”. Namibia apparently has a long history of beer production. According to CNN, the first Namibian brewery opened in 1900, with every ethnic group in Africa having its own methods to create the famous beverage. Nowadays local beer is a tourist attraction and a large source of revenues for both Namibia and Ghana, explaining their placement in the list. As for the rest of the countries comprising the top 5, it is known that beer is very inexpensive there, with Czech Republic offering beer at a lower price than water in some pubs and restaurants.

  • Moving to wine, the top spot is with France (arguably) highly anticipated. Bordeaux, a rural city in France has got some of the world’s finest wineries, with its wine being beloved both locally and internationally. Similar reasoning applies to Portugal. Douro Port is the country’s most famous wine, with many casual drinkers switching to premium wines, according to wininteligence.com. What is definitely worthy of commentary is Andorra’s placement on the list, because of its small population ranking it amongst the six European countries with the least residents. Again, as Andorra is a famous tourist destination because of its ski resorts, the high wine consumption is mostly attributed to tourists.

  • Finally, regarding spirits we have mixed signals about the top 5. Although we would most definitely expect eastern countries such as Russia, the motherland of Vodka to be high on the list, Grenada seems to be number one, with another Caribbean country, St. Lucia, following on fifth. According to jamaicaobserver.com, both countries seem to have problems with illegal consumption since 2014, when the WHO declared Grenada as the country with the highest alcohol per capita consumption in the Caribbean.

2 Task 2: Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset.

movies <- read.csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <int> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <int> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <int> 760505847, 658672302, 652177271, 623279547, 533316…
## $ budget              <int> 237000000, 200000000, 150000000, 220000000, 185000…
## $ cast_facebook_likes <int> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <int> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <int> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

Let’s see if there are any missing values in the dataset.

skim(movies) #no missing values (n_missing is 0 for all variables)
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁

Looking at the output above and the n_missing variable, we can see that this is 0 for all variables, meaning that there are no missing values.

Then we want to find out if there are duplicate entries in the dataset.

#filter for duplicates
movies %>% 
  filter(duplicated(title)) %>% 
  glimpse()
## Rows: 54
## Columns: 11
## $ title               <chr> "Spider-Man 3", "Alice in Wonderland", "Oz the Gre…
## $ genre               <chr> "Action", "Adventure", "Adventure", "Drama", "Dram…
## $ director            <chr> "Sam Raimi", "Tim Burton", "Sam Raimi", "Kenneth B…
## $ year                <int> 2007, 2010, 2013, 2015, 2008, 2014, 2001, 2015, 20…
## $ duration            <int> 156, 108, 130, 105, 122, 101, 119, 94, 94, 125, 10…
## $ gross               <int> 336530303, 334185206, 234903076, 201148159, 191449…
## $ budget              <int> 258000000, 200000000, 215000000, 95000000, 3700000…
## $ cast_facebook_likes <int> 46055, 79957, 73441, 4671, 44060, 2690, 2776, 1788…
## $ votes               <int> 383071, 306336, 175413, 103749, 348007, 167089, 17…
## $ reviews             <int> 2294, 1187, 1036, 666, 1885, 839, 1598, 379, 379, …
## $ rating              <dbl> 6.2, 6.5, 6.4, 7.0, 5.2, 5.9, 5.7, 6.7, 6.7, 6.8, …

The output shows that we have 54 duplicates based on the title of the movies. After having checked the duplicates (see example below), we have come to the conclusion that, while the duplicated entries are not 100% identical, they only differ in terms of the number of votes and reviews and thereby to a very little extent. This is why we remove the 54 duplicates identified above as a next step from our dataset.

#check in which way duplicate observations differ with example of Cinderalla
movies %>% 
  filter(title=="Cinderella")
##        title genre        director year duration     gross   budget
## 1 Cinderella Drama Kenneth Branagh 2015      105 201148159 95000000
## 2 Cinderella Drama Kenneth Branagh 2015      105 201148159 95000000
##   cast_facebook_likes  votes reviews rating
## 1                4671 103737     665      7
## 2                4671 103749     666      7

Let’s count movies by genre and see which genre has the most entries in our dataset. Our guess would be comedy because it’s the easiest type for entertainment.

movies <- movies %>% 
  distinct(title,
           .keep_all = TRUE) #Removing the 54 duplicates

movies %>% 
  count(genre,
        sort = TRUE) #Count by genre and sort in descending order
##          genre   n
## 1       Comedy 844
## 2       Action 719
## 3        Drama 484
## 4    Adventure 281
## 5        Crime 198
## 6    Biography 135
## 7       Horror 128
## 8    Animation  35
## 9      Fantasy  26
## 10 Documentary  25
## 11     Mystery  15
## 12      Sci-Fi   7
## 13      Family   3
## 14     Musical   2
## 15     Romance   2
## 16     Western   2
## 17    Thriller   1

We now produce a table with the average gross earning and budget (gross and budget) by genre and try to calculate the return on budget for each genre.

Our approach is first creating a variable named return_on_budget for every movie and then calculating its average in each genre. In this way we can better represent how profitable a movie is on average in each genre. An alternative approach would be using average earning over average budget, but the result would skew to those movies with bigger budget, so this would be somewhat misleading.

#Producing a table of each genre's average gross, budget and return
movies %>% 
  mutate(return_on_budget = gross/budget) %>%   
  group_by(genre) %>% 
    summarise(gross_avg = mean(gross),
              budget_avg = mean(budget),
              return_on_budget_avg = mean(return_on_budget)) %>% 
    arrange(desc(return_on_budget_avg))
## # A tibble: 17 × 4
##    genre        gross_avg budget_avg return_on_budget_avg
##    <chr>            <dbl>      <dbl>                <dbl>
##  1 Horror       37782310.  13804379.             86.1    
##  2 Biography    45201805.  28543696.             22.3    
##  3 Musical      92084000    3189500              18.8    
##  4 Family      149160478.  14833333.             14.1    
##  5 Documentary  17353973.   5887852.              8.70   
##  6 Western      20821884    3465000               7.06   
##  7 Fantasy      41902674.  18484615.              6.10   
##  8 Animation    98433792.  61701429.              5.01   
##  9 Comedy       42487808.  24458506.              3.70   
## 10 Romance      31264848.  25107500               3.17   
## 11 Drama        36754959.  25832605.              2.98   
## 12 Mystery      69117136.  41500000               2.90   
## 13 Adventure    94350236.  64692313.              2.44   
## 14 Crime        37601525.  26527405.              2.19   
## 15 Action       86270343.  70774558.              1.93   
## 16 Sci-Fi       29788371.  27607143.              1.58   
## 17 Thriller         2468     300000               0.00823

Horror movies, biographies and musical movies are the top 3 in terms of profitability.

Let us also find out who are the directors that create the highest revenue-generating movies.

#Produce a table of directors regarding their gross
movies %>% 
  group_by(director) %>% 
  summarise(gross_total = sum(gross),
            gross_mean = mean(gross),
            gross_median = median(gross),
            gross_sd = sd(gross)) %>% 
  slice_max(order_by = gross_total,
            n = 15)
## # A tibble: 15 × 5
##    director          gross_total gross_mean gross_median   gross_sd
##    <chr>                   <dbl>      <dbl>        <dbl>      <dbl>
##  1 Steven Spielberg   4014061704 174524422.   164435221  101421051.
##  2 Michael Bay        2195443511 182953626.   168468240. 125789167.
##  3 James Cameron      1909725910 318287652.   175562880. 309171337.
##  4 Christopher Nolan  1813227576 226653447    196667606. 187224133.
##  5 George Lucas       1741418480 348283696    380262555  146193880.
##  6 Robert Zemeckis    1619309108 124562239.   100853835   91300279.
##  7 Tim Burton         1557078534 111219895.    69791834   99304293.
##  8 Sam Raimi          1443167519 180395940.   138480208  174705230.
##  9 Clint Eastwood     1378321100  72543216.    46700000   75487408.
## 10 Francis Lawrence   1358501971 271700394.   281666058  135437020.
## 11 Ron Howard         1335988092 111332341    101587923   81933761.
## 12 Gore Verbinski     1329600995 189942999.   123207194  154473822.
## 13 Andrew Adamson     1137446920 284361730    279680930. 120895765.
## 14 Shawn Levy         1129750988 102704635.    85463309   65484773.
## 15 Ridley Scott       1128857598  80632686.    47775715   68812285.

Apparently Steven Spielberg leads the ranking, however, he is not one of the directors with the highest mean revenue (gross_mean) or median revenue (gross_median) or lowest standard deviation of revenues (gross_sd).

How about ratings? Let’s see how each genre is regarded by its viewers.

#Produce a table showing each genre's rating information
movies %>% 
  group_by(genre) %>% 
  summarise(rating_mean = mean(rating),
            rating_min = min(rating),
            rating_max = max(rating),
            rating_median = median(rating),
            rating_sd = sd(rating)) %>% 
  arrange(desc(rating_mean))
## # A tibble: 17 × 6
##    genre       rating_mean rating_min rating_max rating_median rating_sd
##    <chr>             <dbl>      <dbl>      <dbl>         <dbl>     <dbl>
##  1 Biography          7.11        4.5        8.9          7.2      0.760
##  2 Crime              6.92        4.8        9.3          6.9      0.853
##  3 Mystery            6.84        4.6        8.5          6.7      0.910
##  4 Musical            6.75        6.3        7.2          6.75     0.636
##  5 Drama              6.74        2.1        8.8          6.8      0.915
##  6 Documentary        6.66        1.6        8.5          7.4      1.77 
##  7 Sci-Fi             6.66        5          8.2          6.4      1.09 
##  8 Animation          6.65        4.5        8            6.9      0.968
##  9 Romance            6.65        6.2        7.1          6.65     0.636
## 10 Adventure          6.51        2.3        8.6          6.6      1.11 
## 11 Family             6.5         5.7        7.9          5.9      1.22 
## 12 Action             6.23        2.1        9            6.3      1.04 
## 13 Comedy             6.11        1.9        8.8          6.2      1.02 
## 14 Fantasy            6.08        4.3        7.9          6.2      0.953
## 15 Horror             5.79        3.6        8.5          5.85     0.987
## 16 Western            5.7         4.1        7.3          5.7      2.26 
## 17 Thriller           4.8         4.8        4.8          4.8     NA
# Graph the distribution of ratings for all movies
ggplot(movies, aes(x=rating)) +
  geom_histogram(bins = 20,
                 color = "grey",
                 fill = "lightblue") +
  labs(title = "Distribution of ratings for all movies",
       y = "Number of Movies",
       x = "Rating Score") +
  NULL

#Graph the distribution of ratings per genre (Thriller only includes 1 movie, so no graph shown)
ggplot(movies,
       aes(x = rating)) +
#Density plot to make comparison among genres easier, considering difference in sample sizes
  geom_density(color = "grey",    
               fill = "lightblue") +
  facet_wrap(~ genre) +
  labs(title = "Distribution of ratings per genre",
       y = NULL,
       x = "Rating Score") +
  NULL

#check why no curve for genre Thriller
movies %>% 
  filter(genre=="Thriller")
##       title    genre     director year duration gross budget
## 1 Locker 13 Thriller Bruce Dellis 2014       95  2468 300000
##   cast_facebook_likes votes reviews rating
## 1                2048   241      15    4.8

We only have 1 movie categorized as thriller in the dataset, so there’s no density plot for thriller. We observe that genres like family, western, romance and musical have multimodal distribution, because they only have single-digit samples. If we do not take these multimodal genres into account, the best-regarded genres are biography, crime and documentary, because their rating scores are the most concentrated around the peak in distribution and their peaks are highest-scored. The box plot also indicates the same, as depicted below.

#Box plot to make the comparison again, telling the same conclusion (biography wins) in a more quantitative perspective

ggplot(movies,
       aes(x = rating)) +
  geom_boxplot(color = "grey",    
               fill = "lightblue") +
  facet_wrap(~ genre) +
  labs(title = "Distribution of ratings per genre",
       y = NULL,
       x = "Rating Score") +
  NULL

The importance of social media nowadays could mean that the number of facebook likes that the cast of a movie has is likely to be a good predictor of how much money a movie will make at the box office. We therefore examine the relationship between gross and cast_facebook_likes and find out if this is true.

#Create a scatterplot and a trend line
ggplot(movies,
       aes(x = cast_facebook_likes,
           y = gross))+
  geom_point(alpha = 0.5) + 
  geom_smooth() +
  
  #Log scale presents a more significant correlation
  #Don't want to use scientific notation here
  scale_y_log10(labels = scales::comma) + 
  scale_x_log10(labels = scales::comma) +
  
  labs (x = "Number of Facebook Likes for Cast",
        y = "Gross Revenue of Movie",
        title = "Relationship between facebook likes of cast and gross of movie") +
  NULL

#Calculate correlation coefficient to understand strength of relationship
movies %>% 
  select(cast_facebook_likes,gross) %>% 
  cor()
##                     cast_facebook_likes gross
## cast_facebook_likes               1.000 0.208
## gross                             0.208 1.000

Looking at the scatterplot that uses a logarithmic scale as well as the correlation coefficient, there is a certain positive relationship between the number of Facebook likes the cast of a movie has and the revenue this movie will make at the box office. This can be taken as a sign that a more popular or social-media active cast will create higher revenues. However, this relationship is with a correlation coefficient of 0.213 not very strong.

We also examine the relationship between gross and budget to see whether the budget is likely to be a better predictor of how much money a movie will make at the box office.

#Create a scatterplot with a trend line
ggplot(movies,
       aes(x = budget,
           y = gross))+
  geom_point() + 
  scale_y_log10(labels = scales::comma) +
  scale_x_log10(labels = scales::comma) +
  geom_smooth() +
  labs (title = "Relationship between budget and revenue of movie",
        x = "Budget of Movie",
        y = "Gross Revenue of Movie") +
  NULL

#Calculate the correlation coefficient to understand strength of relationship
movies %>% 
  select(budget,gross) %>% 
  cor()
##        budget gross
## budget  1.000 0.639
## gross   0.639 1.000

There is a clear positive relationship between the budget a movie has and the revenues it achieves at the box office (correlation coefficient of 0.641). This means that movies with a higher budget will also be the ones that make more revenues. We can infer that budget is a better predictor of gross revenues than the number of Facebook likes of the cast.

The third possible predictor of gross revenues of a movie could be its IMDB rating. Let’s examine this relationship with the same method and facet it by genre.

#Create a scatterplot with a trend line per genre
ggplot(movies,
       aes(x = rating,
           y = gross))+
  geom_point() + 
  geom_smooth() +
  facet_wrap(~genre) +
  scale_y_continuous(labels = scales::comma) +
  labs (title = "Relationship between the rating of a movie and its gross revenue",
        x = "Rating of Movie",
        y = "Gross Revenue of Movie") +
  NULL

#Calculate correlation coefficient to understand strength of relationship (for all genres together)
movies %>% 
  select(gross,rating) %>% 
  cor()
##        gross rating
## gross  1.000  0.274
## rating 0.274  1.000

Looking at the scatterplots above, a certain positive relationship between the ratings of a movie and its gross revenue can be inferred, especially for the genres action, adventure, comedy, crime, drama and horror. As the steepness of the smooth line increases the higher the ratings get for these genres, this means that an increase in rating for movies that are already highly ranked will bring a higher increase in revenues than a rating increase for movies that are rated low. However, for some genres such as animation or documentary there is no clear relationship, which hints at the fact that people tend to use ratings for some genres more than others to decide whether to watch the movie.

3 Task 3: Returns of financial stocks

We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.

We must first identify which stocks we want to download data for, and for this we must know their ticker symbol. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO year, and the sector and industry the company is in.

nyse <- read_csv(here::here("data","nyse.csv"))

We create a table and a bar plot that shows the number of companies per sector, in descending order.

company_sector_sum <- nyse %>% #create dataframe that counts # of companies per sector
    group_by(sector) %>%
    count(sort = TRUE) 
company_sector_sum   #show the table
## # A tibble: 12 × 2
## # Groups:   sector [12]
##    sector                    n
##    <chr>                 <int>
##  1 Finance                  97
##  2 Consumer Services        79
##  3 Public Utilities         60
##  4 Capital Goods            45
##  5 Health Care              45
##  6 Energy                   42
##  7 Technology               40
##  8 Basic Industries         39
##  9 Consumer Non-Durables    31
## 10 Miscellaneous            12
## 11 Transportation           10
## 12 Consumer Durables         8
ggplot(company_sector_sum,
       aes(x = reorder(sector, -n),
           y = n)) + #plot output from above
  theme(axis.text.x = element_text(angle = 45,  # Tilt the x-label at an angle of 45
                                   hjust = 1)) +
  geom_bar(fill = "lightblue",
           color = "grey",
           stat = "identity") +
  labs(title = "Number of companies per sector",
       x = "Sector",
       y = "Number of Companies") +
  NULL

Next, let’s choose the Dow Jones Industrial Aveareg (DJIA) stocks and their ticker symbols and download some data. Besides the thirty stocks that make up the DJIA, we will also add SPY which is an SP500 ETF.

djia_url <- "https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average"

#get tables that exist on URL
tables <- djia_url %>% 
  read_html() %>% 
  html_nodes(css="table")

# parse HTML tables into a dataframe called djia. 
# Use purr::map() to create a list of all tables in URL
djia <- map(tables, . %>% 
               html_table(fill=TRUE)%>% 
               clean_names())

# constituents
table1 <- djia[[2]] %>% # the second table on the page contains the ticker symbols
  mutate(date_added = ymd(date_added),
         
         # if a stock is listed on NYSE, its symbol is, e.g., NYSE: MMM
         # We will get prices from yahoo finance which requires just the ticker
         
         # if symbol contains "NYSE*", the * being a wildcard
         # then we jsut drop the first 6 characters in that string
         ticker = ifelse(str_detect(symbol, "NYSE*"),
                          str_sub(symbol,7,11),
                          symbol)
         )

# we need a vector of strings with just the 30 tickers + SPY
tickers <- table1 %>% 
  select(ticker) %>% 
  pull() %>% # pull() gets them as a sting of characters
  c("SPY") # and lets us add SPY, the SP500 ETF

Now let us downlaod prices for all 30 DJIA consituents and the SPY ETF that tracks SP500 since January 1, 2020

# Notice the cache=TRUE argument in the chunk options. Because getting data is time consuming, # cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- tickers %>% 
  tq_get(get  = "stock.prices",
         from = "2000-01-01",
         to   = Sys.Date()) %>% # Sys.Date() returns today's price
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 161,491
## Columns: 8
## Groups: symbol [31]
## $ symbol   <chr> "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM"…
## $ date     <date> 2000-01-03, 2000-01-04, 2000-01-05, 2000-01-06, 2000-01-07, …
## $ open     <dbl> 48.0, 46.4, 45.6, 47.2, 50.6, 50.2, 50.4, 51.0, 50.7, 50.4, 4…
## $ high     <dbl> 48.2, 47.4, 48.1, 51.2, 51.9, 51.8, 51.2, 51.8, 50.9, 50.5, 4…
## $ low      <dbl> 47.0, 45.3, 45.6, 47.2, 50.0, 50.0, 50.2, 50.4, 50.2, 49.5, 4…
## $ close    <dbl> 47.2, 45.3, 46.6, 50.4, 51.4, 51.1, 50.2, 50.4, 50.4, 49.7, 4…
## $ volume   <dbl> 2173400, 2713800, 3699400, 5975800, 4101200, 3863800, 2357600…
## $ adjusted <dbl> 27.2, 26.1, 26.9, 29.0, 29.6, 29.4, 28.9, 29.0, 29.0, 28.6, 2…

Given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

We then summarise monthly returns for each of the stocks and SPY, including its min, max, median, mean and standard deviation.

#Create the summary dataset
myStocks_monthly_returns <- myStocks_returns_monthly %>%
  group_by(symbol) %>% 
  summarise(Mean = mean(monthly_returns),
            Minimum = min(monthly_returns),
            Maximum = max(monthly_returns),
            Median = median(monthly_returns),
            StandardDeviation = sd(monthly_returns)) %>%
  arrange(desc(Mean))

#Display the summary in table
myStocks_monthly_returns
## # A tibble: 31 × 6
##    symbol   Mean Minimum Maximum Median StandardDeviation
##    <chr>   <dbl>   <dbl>   <dbl>  <dbl>             <dbl>
##  1 AAPL   0.0269  -0.577   0.454 0.0338            0.115 
##  2 CRM    0.0263  -0.360   0.403 0.0199            0.111 
##  3 V      0.0200  -0.196   0.338 0.0255            0.0661
##  4 UNH    0.0191  -0.306   0.266 0.0219            0.0698
##  5 NKE    0.0162  -0.375   0.396 0.0162            0.0756
##  6 DOW    0.0149  -0.276   0.255 0.0367            0.111 
##  7 CAT    0.0143  -0.353   0.350 0.0133            0.0899
##  8 BA     0.0127  -0.458   0.459 0.0167            0.0926
##  9 MSFT   0.0114  -0.344   0.408 0.0172            0.0819
## 10 GS     0.0110  -0.275   0.312 0.0163            0.0925
## # … with 21 more rows

We then create a density plot for each of the stocks to see their performances.

#Create a density plot of monthly returns
ggplot(myStocks_returns_monthly,
       aes(x = monthly_returns)) + 
  geom_density() +
  facet_wrap(~symbol) +
  scale_x_continuous(labels = scales::percent_format(accuracy=1))+  #x-axis in percentage
  labs(title="Distribution of monthly returns of selected stocks",
        x = "Monthly Return",
        y="Density")+
  NULL

We can infer from the above table and plot…

  • The plot shows that most return distributions are non-normal, evidencing a high level of kurtosis. It also shows that all of the companies have positive average returns, albeit some are more marginally positive than others. AXP has the highest maximum monthly return and AAPL has the lowest minimum.

  • The riskiest stock, surprisingly, is Apple. This is because it has the highest standard deviation (and lowest kurtosis) and lowest minimum return out of index and S&P500 tracker. This is somewhat counterintuitive because Apple also has the highest mean return out of the stock selection, but as risk in finance is measured by volatility we conclude that Apple is the riskiest.

  • We believe that, in line with financial theory, that the SPY is the least risky asset among those examined. This is because its distribution is heavily concentrated around the 0% return demonstrated by the highest density out of the group at a single point. It also has the lowest standard deviation, again expected, due to the diversified nature of the index product. Thus, we conclude that the SPY is the least risky out of the selection.

Finally, we want to make a scatter plot that shows the expected monthly return and risk of a stock.

#Create a scatterplot with mean and sd of monthly returns
ggplot(myStocks_monthly_returns,
       aes(x = StandardDeviation,
           y = Mean)) +
  geom_point()+
  geom_smooth(method = "lm",
              se=FALSE)+
  ggrepel::geom_text_repel(aes(label = symbol)) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 0.1))+
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1))+
  labs(title="Expected monthly return of stocks and their risk",
        x = "Standard Deviation",
        y="Expected Monthly Return")+
  NULL

What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?

We can infer from the above plot…

  • The plot shows that, in general, the higher the mean return the higher the risk - or standard deviation. This is as expected because financial theory implies that for higher returns investors must anticipate greater volatility/risk.

  • There are several stocks that sit below the line of best fit and therefore offer investors returns that do not adequately compensate for the greater risk that is undertaken. These stocks are DOW, CSCO, INTC. The worst stock for this is CSCO which offers a very low mean return but offers the fourth highest level of volatility out of the group - this would be a very bad investment. There are a few stocks that outperform the expected mean-volatility trade off, such as V and UNH, these stocks offer greater returns but with below expected volatility.

4 Task 4: Is inflation transitory?

A recent study by the Bank for International Settlements (BIS) claimed that the Current Inflation Spike Is Just Transitory. As the article says,

The surge in inflation seen across major economies is probably short lived because it’s confined to just a few sectors of the economy, according to the Bank for International Settlements.

New research by the BIS’s Claudio Borio, Piti Disyatat, Egon Zakrajsek and Dora Xia adds to one of the hottest debates in economics – how long the current surge in consumer prices will last. Both Federal Reserve Chair Jerome Powell and his euro-area counterpart Christine Lagarde have said the pickup is probably transitory, despite a snarled global supply chain and a spike in energy prices.

To better understand inflation, we want to use CPI and 10-year yield to produce the following graph:

Our two variables are:

cpi  <-   tq_get("CPIAUCSL", get = "economic.data",
                       from = "1980-01-01") %>% 
  rename(cpi = symbol,  # FRED data is given as 'symbol' and 'price'
         rate = price) %>% # we rename them to what they really are, e.g., cpi and rate
  
  # calculate yearly change in CPI by dividing current month by same month a year (or 12 months) earlier, minus 1
  mutate(cpi_yoy_change = rate/lag(rate, 12) - 1)

ten_year_monthly  <-   tq_get("GS10", get = "economic.data",
                       from = "1980-01-01") %>% 
  rename(ten_year = symbol,
         yield = price) %>% 
  mutate(yield = yield / 100) # original data is not given as, e.g., 0.05, but rather 5, for five percent

# we have the two dataframes-- we now need to join them, and we will use left_join()
# base R has a function merge() that does the same, but it's slow, so please don't use it

mydata <- 
  cpi %>% 
  left_join(ten_year_monthly, by="date") %>% 
  mutate(
    year = year(date), # using lubridate::year() to generate a new column with just the year
    month = month(date, label = TRUE),
    decade=case_when(
      year %in% 1980:1989 ~ "1980s",
      year %in% 1990:1999 ~ "1990s",
      year %in% 2000:2009 ~ "2000s",
      year %in% 2010:2019 ~ "2010s",
      TRUE ~ "2020s"
      )
  )

Now let’s reproduce the graph.

ggplot(mydata,
       aes(x = cpi_yoy_change,
           y = yield,
           color = decade)) +
  geom_point() +
  geom_smooth(method = 'lm',    #create a trendline without confidence interval
              se = FALSE) +
  facet_wrap(~ decade,    #faceted by decade
             nrow = 5,
             scales = "free")+
  ggrepel::geom_text_repel(aes(label = format(date,    #geom_text_repel to avoid overlap
                                              format = "%b %Y")),
                           check_overlap = TRUE,
                           size = 2,
                           segment.color="transparent")+
  scale_x_continuous(labels = scales::percent)+    #x-axis as percentage
  scale_y_continuous(labels = scales::percent)+    #y-axis as percentage
  labs(x = "CPI Yearly change",
       y = "10-year Treasury Constant Maturity Rate",
       title = "How are CPI and 10-year related?",
       caption = "Data Source: FRED") +    #Add titles and a caption
  theme_bw()+    #Use classic theme
  theme(text = element_text(size = 12),
        legend.position = "none",    #Remove legend
        aspect.ratio = 1/12) + #Set the ratio of each faceted graph's height and length
  NULL

5 Challenge 1: Replicating a chart

The task here is again to reproduce a beautiful graph. The graph is about the vaccination rates and 2020 election votes in every county of the United States and it’s from the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common?.

We have three data sources as below:

  1. Vaccination by county from the CDC
  2. 2020 Election Votes from County Presidential Election Returns 2000-2020
  3. Estimate of the population of each county
head(vaccinations %>% 
  filter(date=="09/20/2021",series_complete_12plus_pop_pct==0))
## # A tibble: 6 × 32
##   date       fips  mmwr_week recip_county     recip_state series_complete_pop_p…
##   <chr>      <chr>     <dbl> <chr>            <chr>                        <dbl>
## 1 09/20/2021 48139        38 Ellis County     TX                               0
## 2 09/20/2021 48121        38 Denton County    TX                               0
## 3 09/20/2021 48227        38 Howard County    TX                               0
## 4 09/20/2021 48387        38 Red River County TX                               0
## 5 09/20/2021 48223        38 Hopkins County   TX                               0
## 6 09/20/2021 48337        38 Montague County  TX                               0
## # … with 26 more variables: series_complete_yes <dbl>,
## #   series_complete_12plus <dbl>, series_complete_12plus_pop_pct <dbl>,
## #   series_complete_18plus <dbl>, series_complete_18plus_pop_pct <dbl>,
## #   series_complete_65plus <dbl>, series_complete_65plus_pop_pct <dbl>,
## #   completeness_pct <dbl>, administered_dose1_recip <dbl>,
## #   administered_dose1_pop_pct <dbl>, administered_dose1_recip_12plus <dbl>,
## #   administered_dose1_recip_12plus_pop_pct <dbl>, …

There are 277 counties, mainly from Texas, that show a 0% vaccination at the latest point in time. Looking at them in more detail (see above), we see that there are no values for the other variables. This indicates that there is no vaccination data for these counties available, which is why we will remove them from the dataset.

First we need to process the original dataset a little bit.

vaccinations_latest <- vaccinations %>% 
  filter(date == "09/20/2021") %>% #filter for most recent data entries and those for which vaccination rate available
  mutate(pct_vaccinated = case_when(
    recip_state %in% c("CA", "GA", "IA", "MI", "TX") ~ 
      administered_dose1_pop_pct, #use data for doses administered for certain states
    T ~ series_complete_pop_pct)) %>% 
  filter(pct_vaccinated != 0) %>% #drop observations with 0% vaccination rates
  select(fips, pct_vaccinated) #select only those columns that necessary for graph

election2020_results_Trump <- election2020_results %>% #create new dataframe that sums Trump votes from duplicates
  filter(candidate == "DONALD J TRUMP", mode=="TOTAL") %>% 
  mutate(Trump_perc = candidatevotes/totalvotes) %>% #calculate %age of Trump votes per county
  select(state_po, county_name, fips, candidate, Trump_perc) #last variable to calculate share of Trump votes

chart_data <- election2020_results_Trump %>% #join dataframes
  left_join(population,
            by = "fips") %>% 
  left_join(vaccinations_latest) %>% 
  #mutate(trump_maj = ifelse(Trump_perc >= 0.5, "Yes","No")) #create variable to color graph later
  mutate(pct_vaccinated = pct_vaccinated/100) %>% 
  distinct(fips, .keep_all = TRUE) #remove 1 duplicate

head(chart_data)
## # A tibble: 6 × 7
##   state_po county_name fips  candidate      Trump_perc pop_estimate_2019
##   <chr>    <chr>       <chr> <chr>               <dbl>             <dbl>
## 1 AL       AUTAUGA     01001 DONALD J TRUMP      0.714             55869
## 2 AL       BALDWIN     01003 DONALD J TRUMP      0.762            223234
## 3 AL       BARBOUR     01005 DONALD J TRUMP      0.535             24686
## 4 AL       BIBB        01007 DONALD J TRUMP      0.784             22394
## 5 AL       BLOUNT      01009 DONALD J TRUMP      0.896             57826
## 6 AL       BULLOCK     01011 DONALD J TRUMP      0.248             10101
## # … with 1 more variable: pct_vaccinated <dbl>

We attempt to calculate the total percentage of people vaccinated. However, our result for September shows a number that is less than the one shown from the graph in April which is why we will not add the horizontal line showing the actual ratio of people vaccinated in the graph.

#calculate %age of entire population vaccinated for horizontal line in chart
get_total_pct <- chart_data %>% 
  mutate(pop_vaccinated = pct_vaccinated*pop_estimate_2019) %>% 
  select(pop_vaccinated, pop_estimate_2019) %>% 
  colSums(na.rm = TRUE)

get_total_pct[1]/get_total_pct[2] #<50.8% shown in the graph from April
## pop_vaccinated 
##          0.451

Then we can reproduce the graph.

ggplot(chart_data,
       aes(x = Trump_perc,
           y = pct_vaccinated,
           size = pop_estimate_2019)) + #draw points based on size of population of county
  geom_point(color = "darkblue", 
             alpha = 0.4,
             na.rm = TRUE) +
  scale_size_continuous(range = c(1, 40)) + #specify scale for size of population-based points
  geom_point(size = 0.75) + #add points to show vaccination level per county
  geom_smooth(method = "lm", #add regression line
              se = FALSE,
              size=0.5) +
  annotate("rect", #change background color
           xmin = 0,
           xmax = 0.55,
           ymin = -Inf,
           ymax = Inf,
           alpha = 0.3,
           fill = "blue") +
  annotate("rect", #change background color
           xmin = 0.45,
           xmax = 1,
           ymin = -Inf,
           ymax = Inf,
           alpha = 0.3,
           fill = "red") +
  labs(title = "COVID-19 VACCINATION LEVELS OUT OF TOTAL POPULATION PER COUNTY",
       subtitle = "(most states based on FULLY vaccinated only; CA, IA, GA, MI & TX, based on total doses administered)",
       y = "% of Total Population Vaccinated",
       x = "2020 Trump Vote %") +
  geom_hline(yintercept = 0.85, #add horizontal line
             linetype = "dashed")+
  annotate("text",
           x = 0.12,
           y = 0.865,
           label = "Herd immunity threshold (?)",
           size = 3,
           color = "blue",
           fontface = 2)+
  #geom_hline(yintercept = 0.508,
  #           linetype = "dashed")+ #no value for actual %age added
  #annotate("text",x=0.0528, y=0.534,label="ACTUAL: 50.8%",size=3, color="blue",fontface=2)+ 
  geom_hline(yintercept = 0.539, #add horizontal line
             linetype = "dashed")+
  annotate("text",
           x = 0.07,
           y = 0.549,
           label = "TARGET: 53.9%",
           size = 3,
           color = "blue",
           fontface = 2)+
  theme_bw() +
  scale_x_continuous(minor_breaks = seq(0, 1, 0.05), #change gridlines
                     breaks = seq(0, 1.05, by = 0.05),
                     expand=c(0,0), limits=c(0,1.01),
                     labels = scales::percent_format(accuracy = 5L)) + #change scale to %
  scale_y_continuous(minor_breaks = seq(0, 1, 0.05), #change gridlines
                     breaks = seq(0, 1.05, by = 0.05),
                     expand=c(0,0), limits=c(0,1.02),
                     labels = scales::percent_format(accuracy = 5L)) + #change scale to %
  theme(legend.position = "none") + #remove legend
  NULL

6 Challenge 2: Opinion polls for the 2021 German elections

The second challenge also deals with the reproduction of a sophisticated graph.

The Guardian newspaper has an election poll tracker for the upcoming German election and we will reproduce the graph of this tracker.

The following code will scrape the wikipedia page and import the table in a dataframe.

url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"

# similar graphs and analyses can be found at 
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel


# get tables that exist on wikipedia page 
tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called polls 
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())


# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
  slice(2:(n()-1)) %>%  # drop the first row, as it contains again the variable names and last row that contains 2017 results
  mutate(
         # polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
         # and we extract it by picking the last 11 characters from that field
         end_date = str_sub(fieldwork_date, -11),
         
         # end_date is still a string, so we convert it into a date object using lubridate::dmy()
         end_date = dmy(end_date),
         
         # we also get the month and week number from the date, if we want to do analysis by month- week, etc.
         month = month(end_date),
         week = isoweek(end_date)
         )
glimpse(german_election_polls) #check dataset
## Rows: 250
## Columns: 16
## $ polling_firm   <chr> "2021 federal election", "Wahlkreisprognose", "Ipsos", …
## $ fieldwork_date <chr> "26 Sep 2021", "22–24 Sep 2021", "22–23 Sep 2021", "22–…
## $ samplesize     <chr> "–", "1,400", "2,000", "1,273", "2,002", "1,554", "10,0…
## $ abs            <chr> "", "–", "–", "–", "26", "–", "–", "–", "–", "–", "–", …
## $ union          <dbl> NA, 22.5, 22.0, 23.0, 22.0, 25.0, 23.0, 21.0, 21.5, 21.…
## $ spd            <dbl> NA, 25.5, 26.0, 25.0, 25.0, 26.0, 25.0, 25.0, 25.0, 25.…
## $ af_d           <dbl> NA, 11.0, 11.0, 10.0, 10.0, 10.0, 10.0, 12.0, 11.0, 11.…
## $ fdp            <dbl> NA, 12.0, 12.0, 11.0, 12.0, 10.5, 12.0, 11.0, 12.5, 11.…
## $ linke          <dbl> NA, 7.0, 7.0, 6.0, 6.0, 5.0, 6.0, 7.0, 6.5, 7.0, 6.5, 6…
## $ grune          <dbl> NA, 14.0, 16.0, 16.5, 17.0, 16.0, 16.0, 14.0, 15.0, 16.…
## $ fw             <chr> "", "–", "–", "3", "3", "–", "–", "2", "–", "–", "–", "…
## $ others         <chr> "", "8", "6", "5.5", "5", "7.5", "8", "7", "8.5", "9", …
## $ lead           <chr> "", "3", "4", "2", "3", "1", "2", "4", "3.5", "4", "3",…
## $ end_date       <date> 2021-09-26, 2021-09-24, 2021-09-23, 2021-09-23, 2021-0…
## $ month          <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
## $ week           <dbl> 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 37,…

Before plotting the graph, we first need to calculate rolling means for every party’s position in previous polls.

#Calculate everyday means for every party's position
german_election_polls_mean <- german_election_polls %>%
  group_by(end_date) %>% 
  summarise(mean_CDU = mean(union),
            mean_SPD = mean(spd),
            mean_Grune = mean(grune),
            mean_AfD = mean(af_d),
            mean_FDP = mean(fdp),
            mean_Linke = mean(linke))

#Then calculate 14-day rolling mean
german_election_polls_rollingmean <- german_election_polls_mean %>% 
  mutate(CDU14 = zoo::rollmean(mean_CDU,
                               k = 14,
                               fill = NA,
                               align = "right"),   #Make sure the calculation is based on 14 days BEFORE the date we examine instead of AFTER or BETWEEN
         SPD14 = zoo::rollmean(mean_SPD,
                               k = 14,
                               fill = NA,
                               align = "right"),
         Grune14 = zoo::rollmean(mean_Grune,
                               k = 14,
                               fill = NA,
                               align = "right"),
         AfD14 = zoo::rollmean(mean_AfD,
                               k = 14,
                               fill = NA,
                               align = "right"),
         FDP14 = zoo::rollmean(mean_FDP,
                               k = 14,
                               fill = NA,
                               align = "right"),
         Linke14 = zoo::rollmean(mean_Linke,
                               k = 14,
                               fill = NA,
                               align = "right"))

#Just to check if everything goes right
glimpse(german_election_polls_rollingmean)
## Rows: 146
## Columns: 13
## $ end_date   <date> 2021-01-04, 2021-01-05, 2021-01-06, 2021-01-08, 2021-01-11…
## $ mean_CDU   <dbl> 36.5, 36.0, 35.0, 36.0, 36.0, 36.0, 37.0, 35.0, 35.0, 35.2,…
## $ mean_SPD   <dbl> 15.5, 15.0, 14.0, 14.0, 15.0, 15.0, 15.0, 15.0, 15.0, 14.5,…
## $ mean_Grune <dbl> 18.0, 18.0, 21.0, 20.0, 18.0, 18.0, 20.0, 20.0, 19.0, 17.2,…
## $ mean_AfD   <dbl> 10.00, 10.00, 10.00, 8.00, 10.00, 10.00, 10.00, 9.00, 9.00,…
## $ mean_FDP   <dbl> 6.75, 6.00, 7.00, 7.00, 7.50, 7.00, 5.00, 6.00, 7.00, 8.75,…
## $ mean_Linke <dbl> 7.75, 9.00, 7.00, 8.00, 8.00, 8.00, 8.00, 8.00, 8.00, 7.75,…
## $ CDU14      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 35.9, 3…
## $ SPD14      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 14.9, 1…
## $ Grune14    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 18.9, 1…
## $ AfD14      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 9.61, 9…
## $ FDP14      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6.85, 6…
## $ Linke14    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7.85, 7…

Rolling mean for the first 13 recorded days are NA because we have no previous data in this dataset. It does not hurt our graphing, so we can dismiss it this time.

#Assign a color to each subset in preparation, try to replicate the color in the original graph as close as possible
color <- c("CDU/CSU" = "black",
           "SPD" = "red3",
           "Grune" = "green3",
           "AfD" = "steelblue4",
           "FDP" = "goldenrod1",
           "Linke" = "mediumorchid3")

ggplot(german_election_polls,
       aes(x = end_date)) +
  
  #Graph the point with some transparency
  geom_point(aes(y = union,
             color = "CDU/CSU"),
             alpha = 0.3,
             size = 2) +
  geom_point(aes(y = spd,
             color = "SPD"),
             alpha = 0.3,
             size = 2) +
  geom_point(aes(y = grune,
             color = "Grune"),
             alpha = 0.3,
             size = 2) +
  geom_point(aes(y = af_d,
             color = "AfD"),
             alpha = 0.3,
             size = 2) +
  geom_point(aes(y = fdp,
             color = "FDP"),
             alpha = 0.3,
             size = 2) +
  geom_point(aes(y = linke,
             color = "Linke"),
             alpha = 0.3,
             size = 2) +
  
  #Graph the line to show trend. Note we have no line for the first 13 sample points because it's based on 14-day rolling average
  geom_line(data = german_election_polls_rollingmean,
            aes(y = CDU14,
            color = "CDU/CSU"),
            size = 1.2) +
  geom_line(data = german_election_polls_rollingmean,
            aes(y = SPD14,
            color = "SPD"),
            size = 1.2) +
  geom_line(data = german_election_polls_rollingmean,
            aes(y = Grune14,
            color = "Grune"),
            size = 1.2) +
  geom_line(data = german_election_polls_rollingmean,
            aes(y = AfD14,
            color = "AfD"),
            size = 1.2) +
  geom_line(data = german_election_polls_rollingmean,
            aes(y = FDP14,
            color = "FDP"),
            size = 1.2) +
  geom_line(data = german_election_polls_rollingmean,
            aes(y = Linke14,
            color = "Linke"),
            size = 1.2) +
  
  #Set the format of the x-axis
  scale_x_date(date_breaks = "1 month",
               date_labels = "%b %Y") +
  
  #Set the limtis and break points of the y-axis
  scale_y_continuous(breaks = c(5,15,25,35,45)) +
  expand_limits(y = c(5,45)) +
  
  #Set color with pre-defined series
  scale_color_manual(name = "Party",
                     values = color) +
  labs(x = "Date",
       y = "Share of Votes (%)",
       title = "German election poll-tracker",
       subtitle = "Who will be the next German chancellor?",
       caption = "Source: Wikipedia, last updated on 26 Sep 2021") +
  theme(text = element_text(size = 14),
        legend.position = "right",
        panel.background = element_rect(fill = "white"),
        panel.grid.major.x = element_blank(),   #Remove the x-axis grid
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(size = 0.5,    #Format the y-axis grid
                                          color = "grey",
                                          linetype = "dashed" )) +
  NULL

7 Details

  • Who did you collaborate with: team members of Study Group A14
  • Approximately how much time did you spend on this problem set: 7 hours on average for each team member
  • What, if anything, gave you the most trouble: formatting the plots

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

Yes.

8 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.